NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Dynamic Camera Poses and Where to Find Them

Rockwell, Chris; Tung, Joseph; Lin, Tsung-Yi; Liu, Ming-Yu; Fouhey, David; Lin, Chen-Hsuan (June 2025, CVPR)

Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-the-art methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications
more » « less
Free, publicly-accessible full text available June 11, 2026
Multi-Object Hallucination in Vision Language Models

https://doi.org/10.52202/079017-1409

Chai, Joyce; Chen, Xuweiyi; Fouhey, David; Ma, Ziqiao; Qian, Shengyi; Xu, Sihan; Yang, Jianing; Zhang, Xuejun (December 2024, Neural Information Processing Systems Foundation, Inc. (NeurIPS))

Full Text Available
Reconstructing Hands in 3D with Transformers

Pavlakos, Georgios; Shan, Dandan; Radosavovic, Ilija; Kanazawa, Angjoo; Fouhey, David; Malik, Jitendra (June 2024, CVPR)

Full Text Available
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

https://doi.org/10.1109/ICRA57147.2024.10610443

Yang, Jianing; Chen, Xuweiyi; Qian, Shengyi; Madaan, Nikhil; Iyengar, Madhavan; Fouhey, David F; Chai, Joyce (May 2024, IEEE)

Full Text Available
Towards A Richer 2D Understanding of Hands at Scale

Cheng, Tianyi; Shan, Dandan; Sultan, Ayda; Higgins, Richard; Fouhey, David (December 2023, NeurIPS)

Full Text Available
MOVES: Manipulated Objects in Video Enable Segmentation

https://doi.org/10.1109/CVPR52729.2023.00613

Higgins, Richard E.L.; Fouhey, David F. (June 2023, CVPR)

Our method uses manipulation in video to learn to understand held-objects and hand-object contact. We train a system that takes a single RGB image and produces a pixel-embedding that can be used to answer grouping questions (do these two pixels go together) as well as hand-association questions (is this hand holding that pixel). Rather than painstakingly annotate segmentation masks, we observe people in realistic video data. We show that pairing epipolar geometry with modern optical flow produces simple and effective pseudo-labels for grouping. Given people segmentations, we can further associate pixels with hands to understand contact. Our system achieves competitive results on hand and hand-held object tasks.
more » « less
Full Text Available
EPIC Fields Marrying 3D Geometry and Video Understanding

Tschernezki, Vadim; Darkhalil, Ahmad; Zhu, Zhifan; Fouhey, David; Laina, Iro; Larlus, Diane; Damen, Dima; Vedaldi, Andrea (December 2023, Advances in neural information processing systems)

Neural rendering is fuelling a unification of learning, 3D geometry and video understanding that has been waiting for more than two decades. Progress, however, is still hampered by a lack of suitable datasets and benchmarks. To address this gap, we introduce EPIC Fields, an augmentation of EPIC-KITCHENS with 3D camera information. Like other datasets for neural rendering, EPIC Fields removes the complex and expensive step of reconstructing cameras using photogrammetry, and allows researchers to focus on modelling problems. We illustrate the challenge of photogrammetry in egocentric videos of dynamic actions and propose innovations to address them. Compared to other neural rendering datasets, EPIC Fields is better tailored to video understanding because it is paired with labelled action segments and the recent VISOR segment annotations. To further motivate the community, we also evaluate three benchmark tasks in neural rendering and segmenting dynamic objects, with strong baselines that showcase what is not possible today. We also highlight the advantage of geometry in semi-supervised video object segmentations on the VISOR annotations. EPIC Fields reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens, and is available from: http://epic-kitchens.github.io/epic-fields
more » « less
Full Text Available
EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

Darkhalil, Ahmad; Shan, Dandan; Zhu, Bin; Ma, Jian; Kar, Amlan; Higgins, Richard; Fidler, Sanja; Fouhey, David; Damen, Dima (November 2022, NeurIPS)

We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the annotations, we introduce three challenges in video object segmentation, interaction understanding and long-term reasoning. For data, code and leaderboards: http://epic-kitchens.github.io/VISOR
more » « less
Full Text Available
COHESIV: Contrastive Object and Hand Embeddings for Segmentation In Video

Shan, Dandan; Higgins, Richard E.L.; Fouhey, David F. (January 2021, Advances in neural information processing systems)

In this paper we learn to segment hands and hand-held objects from motion. Our system takes a single RGB image and hand location as input to segment the hand and hand-held object. For learning, we generate responsibility maps that show how well a hand’s motion explains other pixels’ motion in video. We use these responsibility maps as pseudo-labels to train a weakly-supervised neural network using an attention-based similarity loss and contrastive loss. Our system outperforms alternate methods, achieving good performance on the 100DOH, EPIC-KITCHENS, and HO3D datasets.
more » « less
Full Text Available

Search for: All records